Hypothesis test for two high-dimensional mean vectors.
sarabai(x1, x2)
A vector with the test statistic and the p-value.
A matrix containing the Euclidean data of the first group.
A matrix containing the Euclidean data of the second group.
Michail Tsagris.
R implementation and documentation: Michail Tsagris mtsagris@uoc.gr.
High dimensional data are the multivariate data which have many variables (\(p\)) and usually a small number of observations (\(n\)). It also happens that \(p>n\) and this is the case here in this Section. We will see a simple test for the case of \(p>n\). In this case, the covariance matrix is not invertible and in addition it can have a lot of zero eigenvalues.
The test we will see was proposed by Bai and Saranadasa (1996). Ever since, there have been some more suggestions but I chose this one for its simplicity. There are two datasets, \({\bf X}_1\) and \({\bf X}_2\) of sample sizes \(n_1\) and \(n_2\), respectively. Their corresponding sample mean vectors and covariance matrices are \(\bar{{\bf X}}_1\), \(\bar{{\bf X}}_2\) and \({\bf S}_1\), \({\bf S}_2\) respectively. The assumption here is the same as that of the Hotelling's test we saw before.
Let us define the pooled covariance matrix at first, calculated under the assumption of equal covariance matrices \( {\bf S}_n=\frac{\left(n_1-1\right){\bf S}_1+\left(n_2-1\right){\bf S}_2}{n}\), where \(n=n_1+n_2\). Then define \(B_n=\sqrt{ \frac{n^2}{\left(n+2\right)\left(n-1\right)}\left\lbrace\text{tr}\left({\bf S}_n^2\right)- \frac{1}{n}\left[\text{tr}\left({\bf S}_n\right)\right]^2 \right\rbrace }\). The test statistic is $$ Z=\frac{\frac{n_1n_2}{n_1+n_2}\left(\bar{{\bf X}}_1-\bar{{\bf X}}_2\right)^T\left(\bar{{\bf X}}_1-\bar{{\bf X}}_2\right) -\text{tr}\left({\bf S}_n\right)}{\sqrt{\frac{2\left(n+1\right)}{n}}B_n}. $$
Under the null hypothesis (equality of the two mean vectors) the test statistic follows the standard normal distribution. Bai and Saranadasa (1996) established the asymptotic normality of the test statistics and showed that it has attractive power property when \(p/n \rightarrow c < \infty\) and under some restriction on the maximum eigenvalue of the common population covariance matrix. However, the requirement of \(p\) and \(n\) being of the same order is too restrictive to be used in the "large \(p\) small \(n\)" situation.
Bai Z. D. and Saranadasa H. (1996). Effect of high dimension: by an example of a two sample problem. Statistica Sinica, 6(2): 311--329.
hotel2T2, maov, el.test2, eel.test2
x1 <- matrix( rnorm(40 * 100), ncol = 100 )
x2 <- matrix( rnorm(50 * 100), ncol = 100 )
sarabai(x1, x2)
Run the code above in your browser using DataLab